A big.matrix
consists of an object in Rthat does nothing more than point to
the data structure implemented in C++. The object acts
much like a traditional Rmatrix, but helps protect the user from many inadvertant
memory-consuming pitfalls of traditional Rmatrices and data frames.There are two big.matrix
types which manage
data in different ways. A standard, shared big.matrix
is constrained
to available RAM, and may be shared across
separate Rprocesses. A file-backed big.matrix
may
exceed available RAM by using hard drive space, and may also be
shared across processes. The atomic types of these matrices may be
double
, integer
, short
, or char
(8, 4, 2, and 1 bytes, respectively).
If x
is a big.matrix
, then x[1:5,]
is returned as an R
matrix
containing the first five rows of x
. If x
is of type
double
, then the result will be numeric
; otherwise, the result will
be an integer
Rmatrix. The expression x
alone
will display information about the Robject (e.g. the external pointer) rather
than evaluating the matrix itself (the user should try x[,]
with extreme caution,
recognizing that a huge Rmatrix
will be created).
If x
has a huge number of rows and/or columns, then the use of rownames
and/or colnames
will be extremely memory-intensive and should be avoided. If x
has a huge
number of columns and separated=TRUE
is used (this isn't typically recommended),
the user might want to store the transpose as there is
overhead of a pointer for each column in the matrix.
If separated
is TRUE
, then the memory is allocated into separate
vectors for each column. Use this option with caution
if you have a large number of columns, as shared-memory segments are limited by
OS and hardware combinations.
If separated
is FALSE
, the matrix is
stored in traditional column-major format.
The function is.separated()
returns
the separation type of the big.matrix
.
When a big.matrix
, x
, is passed as an argument
to a function, it is essentially providing call-by-reference rather than
call-by-value behavior. If the function modifies any of the values of x
,
the changes are not limited in scope to a local copy within the function.
This introduces the possibility of side-effects, in contrast to standard
Rbehavior.
A file-backed big.matrix
may exceed available RAM in size by using a file
cache (or possibly multiple file caches, if separated=TRUE
).
This can incur a substantial performance penalty for such large matrices, but less
of a penalty than most other approaches for handling such large objects.
A side-effect of creating a file-backed object is
not only the file-backing(s), but a descriptor file (in the same directory) that is
needed for subsequent attachments (see attach.big.matrix
).
Note that we do not allow setting or changing the dimnames
attributes
by default; such changes would not be reflected in the descriptor objects or
in shared memory. To override this, set
options(bigmemory.allow.dimnames=TRUE)
.
It should also be noted that a user can create an ``anonymous'' file-backed
big.matrix
by specifying "" as the filebacking
argument.
In this case, the backing resides in the temporary directory and a
descriptor file is not created. These should be used with caution since
even anonymous backings use disk space which could eventually fill the
hard drive. Anonymous backings are removed either manually, by a
user, or automatically, when the operating system deems it appropriate.
Finally, note that as.big.matrix
can coerce data frames. It does this by
making any character columns into factors, and then making all factors numeric
before forming the big.matrix
. Level labels are not preserved and must
be managed by the user if desired.